Goto

Collaborating Authors

 Williams County


Selective Masking based Self-Supervised Learning for Image Semantic Segmentation

Wang, Yuemin, Stavness, Ian

arXiv.org Artificial Intelligence

This paper proposes a novel self-supervised learning method for semantic segmentation using selective masking image reconstruction as the pretraining task. Our proposed method replaces the random masking augmentation used in most masked image modelling pretraining methods. The proposed selective masking method selectively masks image patches with the highest reconstruction loss by breaking the image reconstruction pretraining into iterative steps to leverage the trained model's knowledge. We show on two general datasets (Pascal VOC and Cityscapes) and two weed segmentation datasets (Nassar 2020 and Sugarbeets 2016) that our proposed selective masking method outperforms the traditional random masking method and supervised ImageNet pretraining on downstream segmentation accuracy by 2.9% for general datasets and 2.5% for weed segmentation datasets. Furthermore, we found that our selective masking method significantly improves accuracy for the lowest-performing classes. Lastly, we show that using the same pretraining and downstream dataset yields the best result for low-budget self-supervised pretraining. Our proposed Selective Masking Image Reconstruction method provides an effective and practical solution to improve end-to-end semantic segmentation workflows, especially for scenarios that require limited model capacity to meet inference speed and computational resource requirements.


SPOT: Single-Shot Positioning via Trainable Near-Field Rainbow Beamforming

Cai, Yeyue, Mo, Jianhua, Tao, Meixia

arXiv.org Artificial Intelligence

Abstract--Phase-time arrays, which integrate phase shifters (PSs) and true-time delays (TTDs), have emerged as a cost-effective architecture for generating frequency-dependent rainbow beams in wideband sensing and localization. This paper proposes an end-to-end deep learning-based scheme that simultaneously designs the rainbow beams and estimates user positions. Treating the PS and TTD coefficients as trainable variables allows the network to synthesize task-oriented beams that maximize localization accuracy. A lightweight fully connected module then recovers the user's angle-range coordinates from its feedback of the maximum quantized received power and its corresponding subcarrier index after a single downlink transmission. Compared with existing analytical and learning-based schemes, the proposed method reduces overhead by an order of magnitude and delivers consistently lower two-dimensional positioning error .


Predicting Oscar-Nominated Screenplays with Sentence Embeddings

Gross, Francis

arXiv.org Artificial Intelligence

Oscar nominations are an important factor in the movie industry because they can boost both the visibility and the commercial success. This work explores whether it is possible to predict Oscar nominations for screenplays using modern language models. Since no suitable dataset was available, a new one called Movie-O-Label was created by combining the MovieSum collection of movie scripts with curated Oscar records. Each screenplay was represented by its title, Wikipedia summary, and full script. Long scripts were split into overlapping text chunks and encoded with the E5 sentence em bedding model. Then, the screenplay embed dings were classified using a logistic regression model. The best results were achieved when three feature inputs related to screenplays (script, summary, and title) were combined. The best-performing model reached a macro F1 score of 0.66, a precision recall AP of 0.445 with baseline 0.19 and a ROC-AUC of 0.79. The results suggest that even simple models based on modern text embeddings demonstrate good prediction performance and might be a starting point for future research.


On the transferability of Sparse Autoencoders for interpreting compressed models

Gupte, Suchit, Chhabra, Vishnu Kabir, Khalili, Mohammad Mahdi

arXiv.org Artificial Intelligence

Modern LLMs face inference efficiency challenges due to their scale. To address this, many compression methods have been proposed, such as pruning and quantization. However, the effect of compression on a model's interpretability remains elusive. While several model interpretation approaches exist, such as circuit discovery, Sparse Autoencoders (SAEs) have proven particularly effective in decomposing a model's activation space into its feature basis. In this work, we explore the differences in SAEs for the original and compressed models. We find that SAEs trained on the original model can interpret the compressed model albeit with slight performance degradation compared to the trained SAE on the compressed model. Furthermore, simply pruning the original SAE itself achieves performance comparable to training a new SAE on the pruned model. This finding enables us to mitigate the extensive training costs of SAEs.


Sugar-Beet Stress Detection using Satellite Image Time Series

Sadbhave, Bhumika Laxman, Vaeth, Philipp, Dejon, Denise, Schorcht, Gunther, Gregorová, Magda

arXiv.org Artificial Intelligence

Satellite Image Time Series (SITS) data has proven effective for agricultural tasks due to its rich spectral and temporal nature. In this study, we tackle the task of stress detection in sugar-beet fields using a fully unsupervised approach. We propose a 3D convolutional au-toencoder model to extract meaningful features from Sentinel-2 image sequences, combined with acquisition-date-specific temporal encodings to better capture the growth dynamics of sugar-beets. The learned representations are used in a downstream clustering task to separate stressed from healthy fields. The resulting stress detection system can be directly applied to data from different years, offering a practical and accessible tool for stress detection in sugar-beets.


MINT: Multimodal Instruction Tuning with Multimodal Interaction Grouping

Shan, Xiaojun, Cao, Qi, Han, Xing, Yu, Haofei, Liang, Paul Pu

arXiv.org Artificial Intelligence

Recent advances in multimodal foundation models have achieved state-of-the-art performance across a range of tasks. These breakthroughs are largely driven by new pre-training paradigms that leverage large-scale, unlabeled multimodal data, followed by instruction fine-tuning on curated labeled datasets and high-quality prompts. While there is growing interest in scaling instruction fine-tuning to ever-larger datasets in both quantity and scale, our findings reveal that simply increasing the number of instruction-tuning tasks does not consistently yield better performance. Instead, we observe that grouping tasks by the common interactions across modalities, such as discovering redundant shared information, prioritizing modality selection with unique information, or requiring synergistic fusion to discover new information from both modalities, encourages the models to learn transferrable skills within a group while suppressing interference from mismatched tasks. To this end, we introduce MINT, a simple yet surprisingly effective task-grouping strategy based on the type of multimodal interaction. We demonstrate that the proposed method greatly outperforms existing task grouping baselines for multimodal instruction tuning, striking an effective balance between generalization and specialization.


SemanticSugarBeets: A Multi-Task Framework and Dataset for Inspecting Harvest and Storage Characteristics of Sugar Beets

Croonen, Gerardus, Trondl, Andreas, Simon, Julia, Steininger, Daniel

arXiv.org Artificial Intelligence

While sugar beets are stored prior to processing, they lose sugar due to factors such as microorganisms present in adherent soil and excess vegetation. Their automated visual inspection promises to aide in quality assurance and thereby increase efficiency throughout the processing chain of sugar production. In this work, we present a novel high-quality annotated dataset and two-stage method for the detection, semantic segmentation and mass estimation of post-harvest and post-storage sugar beets in monocular RGB images. We conduct extensive ablation experiments for the detection of sugar beets and their fine-grained semantic segmentation regarding damages, rot, soil adhesion and excess vegetation. For these tasks, we evaluate multiple image sizes, model architectures and encoders, as well as the influence of environmental conditions. Our experiments show an mAP50-95 of 98.8 for sugar-beet detection and an mIoU of 64.0 for the best-performing segmentation model.


Field Robot for High-throughput and High-resolution 3D Plant Phenotyping

Esser, Felix, Rosu, Radu Alexandru, Cornelißen, André, Klingbeil, Lasse, Kuhlmann, Heiner, Behnke, Sven

arXiv.org Artificial Intelligence

With the need to feed a growing world population, the efficiency of crop production is of paramount importance. To support breeding and field management, various characteristics of the plant phenotype need to be measured -- a time-consuming process when performed manually. We present a robotic platform equipped with multiple laser and camera sensors for high-throughput, high-resolution in-field plant scanning. We create digital twins of the plants through 3D reconstruction. This allows the estimation of phenotypic traits such as leaf area, leaf angle, and plant height. We validate our system on a real field, where we reconstruct accurate point clouds and meshes of sugar beet, soybean, and maize.


Crop Row Switching for Vision-Based Navigation: A Comprehensive Approach for Efficient Crop Field Navigation

de Silva, Rajitha, Cielniak, Grzegorz, Gao, Junfeng

arXiv.org Artificial Intelligence

Vision-based mobile robot navigation systems in arable fields are mostly limited to in-row navigation. The process of switching from one crop row to the next in such systems is often aided by GNSS sensors or multiple camera setups. This paper presents a novel vision-based crop row-switching algorithm that enables a mobile robot to navigate an entire field of arable crops using a single front-mounted camera. The proposed row-switching manoeuvre uses deep learning-based RGB image segmentation and depth data to detect the end of the crop row, and re-entry point to the next crop row which would be used in a multi-state row switching pipeline. Each state of this pipeline use visual feedback or wheel odometry of the robot to successfully navigate towards the next crop row. The proposed crop row navigation pipeline was tested in a real sugar beet field containing crop rows with discontinuities, varying light levels, shadows and irregular headland surfaces. The robot could successfully exit from one crop row and re-enter the next crop row using the proposed pipeline with absolute median errors averaging at 19.25 cm and 6.77{\deg} for linear and rotational steps of the proposed manoeuvre.


Deep learning-based Crop Row Detection for Infield Navigation of Agri-Robots

de Silva, Rajitha, Cielniak, Grzegorz, Wang, Gang, Gao, Junfeng

arXiv.org Artificial Intelligence

Autonomous navigation in agricultural environments is challenged by varying field conditions that arise in arable fields. State-of-the-art solutions for autonomous navigation in such environments require expensive hardware such as RTK-GNSS. This paper presents a robust crop row detection algorithm that withstands such field variations using inexpensive cameras. Existing datasets for crop row detection does not represent all the possible field variations. A dataset of sugar beet images was created representing 11 field variations comprised of multiple grow stages, light levels, varying weed densities, curved crop rows and discontinuous crop rows. The proposed pipeline segments the crop rows using a deep learning-based method and employs the predicted segmentation mask for extraction of the central crop using a novel central crop row selection algorithm. The novel crop row detection algorithm was tested for crop row detection performance and the capability of visual servoing along a crop row. The visual servoing-based navigation was tested on a realistic simulation scenario with the real ground and plant textures. Our algorithm demonstrated robust vision-based crop row detection in challenging field conditions outperforming the baseline.